Many of the figures and analysis methodology were based off of this tutorial from NYU. https://learn.gencore.bio.nyu.edu/rna-seq-analysis/gene-set-enrichment-analysis/

Inputs Received

Information received from the csv samplesheet:

1. Chen & Phillips Analysis

Metadata information received from the tsv table:

##   replicate strain
## 1      rep1  hrde1
## 2      rep2  hrde1
## 3      rep3  hrde1
## 4      rep1  hrde2
## 5      rep2  hrde2
## 6      rep3  hrde2
## 7      rep1     wt
## 8      rep2     wt
## 9      rep3     wt

Analysis Methods

Here, count data from the RNA-seq experiment is read in the form of a counts matrix. Each column holds data from one sample, and each row represents a gene, such that the i-th row and n-th column tells how reads of gene i were measured in sample n. The values received should be un-normalized counts of sequencing reads or fragments.

# # Read Count Matrix
count.matrix <- read_tsv(trimws(sample_params$salmon_merged_gene_counts_file_path)) %>%
  mutate(across(3:ncol(.), ~ as.integer(.x))) %>%
  column_to_rownames("gene_id")

gene.reference <- dplyr::select(count.matrix, 1)

count.matrix <- count.matrix %>%
  dplyr::select(-1)

Information entered as metadata describes the samples (columns) of the count matrix. This metadata is combined with the sample/column names from the count matrix so the accuracy of the metadata can be reviewed. Take time to review the table below.

# Design ColData Matrix
sample <- colnames(count.matrix)
colData <- data.frame(sample)

# Add Columns for Condition to colData
for (i in colnames(metadata)){
  colData[i] <- metadata[i]
}

colData <- column_to_rownames(colData, "sample")
##            replicate strain 
## hrde1_rep1 "rep1"    "hrde1"
## hrde1_rep2 "rep2"    "hrde1"
## hrde1_rep3 "rep3"    "hrde1"
## hrde2_rep1 "rep1"    "hrde2"
## hrde2_rep2 "rep2"    "hrde2"
## hrde2_rep3 "rep3"    "hrde2"
## wt_rep1    "rep1"    "wt"   
## wt_rep2    "rep2"    "wt"   
## wt_rep3    "rep3"    "wt"

Using the count matrix, column metadata, and user-inputed design formula (expressing the variables to be used in modeling), a DESeq data set is created.

The experimental variable and the associated control/experimental conditions are given by the user in “ref_level_cond” and “ref_level_value” in the sampleInfo file respectively.

Pre-filtering is performed, keeping only genes that have a count of at least 10 in a minimum number of samples. The minimum number of samples is decided by calculating the number of times the reference level value (control condition) is listed in the metadata with the assumption that experimental conditions will be repeated the same number of times.

Standard differential expression analysis is performed with the DESeq function, and regularized log transformation (rlog) transforms count data to a log2 scale for PCA.

# Create DESeq Data Set
dds <- DESeqDataSetFromMatrix(
  countData = count.matrix,
  colData = colData,
  design = as.formula(sample_params$design)
)

# Set Reference Level
reflevelcond <- trimws(sample_params$ref_level_cond)
reflevelvalue <- trimws(sample_params$ref_level_value)
dds[[reflevelcond]] <- relevel(dds[[reflevelcond]], ref = reflevelvalue)

# Subset out genes with low overall counts
smallest.group.size <- length(grep(sprintf("^%s$",reflevelvalue), metadata[[reflevelcond]]))
keep <- rowSums(counts(dds) >= 10) >= smallest.group.size
dds <- dds[keep,]

# Run DESeq
dds <- DESeq(dds)

# Regularized Log Transform
rld <- rlog(dds)

If input data is designated to be batch corrected, the ComBat_seq() function from the SVA Bioconductor package is used. The batch (condition) to be regressed out is given in sampleInfo.csv as “batch_cond”, and the sample information for this condition is taken from the metadata.

The group argument of ComBat_seq() specifies biological covariates, whos signals should be preserved in adjusted data. All conditions (columns) from the metadata which are not to be regressed out are passed to the group argument. Differential expression analysis is otherwise performed as described above on batch-corrected counts.

if (as.logical(trimws(sample_params$batch_correct))) {
  
  # Check that condition for batch correction matches column metadata
  if(all(trimws(sample_params$batch_cond) != colnames(metadata))){
    stop(sprintf("The given condition for batch correction:%s, doesn't match any prior conditions: %s", sample_params$batch_cond, colnames(metadata)))
  }
  
  # Replicate / Batch Correction
  batch <- trimws(sample_params$batch_cond)
  
  batch.correct <- ComBat_seq(
    counts = as.matrix(count.matrix),
    batch = metadata[[batch]],
    group = metadata[[colnames(metadata)[colnames(metadata) != batch]]]) %>%
    as.data.frame()

  # Create DESeq Data Set
  batch.correct.dds <- DESeqDataSetFromMatrix(
    countData = batch.correct,
    colData = colData,
    design = as.formula(sample_params$design))
  
  # Set Factor Level
  reflevelcond <- trimws(sample_params$ref_level_cond)
  reflevelvalue <- trimws(sample_params$ref_level_value)
  batch.correct.dds[[reflevelcond]] <- relevel(batch.correct.dds[[reflevelcond]], ref = reflevelvalue)
  
  # Subset out genes with low overall counts
  smallest.group.size <- length(grep(sprintf("^%s$",reflevelvalue), metadata[[reflevelcond]]))
  keep <- rowSums(counts(batch.correct.dds) >= 10) >= smallest.group.size
  batch.correct.dds <- batch.correct.dds[keep,]
  
  # Run DESeq
  batch.correct.dds <- DESeq(batch.correct.dds)
  
  # Regularized Log Transform
  batch.correct.rld <- rlog(batch.correct.dds)
}

Reference level condition and value (ref_level_cond and ref_level_value respectively) inputs from sampleInfo.csv are used to create contrast terms to compare treatment samples with control samples. This relies on the design formula being formatted correctly to produce comparisons relevant to the experimental condition.

# # Find relevant contrasts
contrasts <- list()

# Factor values of experimental condition
pattern <- factor(metadata[[reflevelcond]])
comp <- levels(pattern)

# Remove control state from experimental condition vector
comp <- comp[comp != reflevelvalue]

# Create Contrasts
for (i in 1:length(comp)){
  contrasts[[i]] <- c(reflevelcond, comp[i], reflevelvalue)
}

Based on the number of experimental conditions (identified in the ref_level_cond column of the metadata) and the contrast terms comparing them to the experimental control (ref_level_value), differential expression results are extracted from the DESeq data set object.

# Find results from relevant contrasts
results <- list()

# Pull results from DESeq object with contrasts
for (i in 1:length(contrasts)){
  results[[i]] <- list()
  results[[i]][["DataFrame"]] <- results(dds.for.analysis, contrast = contrasts[[i]]) %>%
    data.frame() %>%
    rownames_to_column("gene_id") %>%
    mutate(gene_name = gene.reference[gene_id, 1], .before=baseMean) %>% # Add gene names to DESeq results matrices
    column_to_rownames("gene_id")
  
  results[[i]][["DESeqDataSet"]] <- results(dds.for.analysis, contrast = contrasts[[i]])
}

Principle Component Analysis

The Bioconductor package PCAtools is used to perform principle component analysis on regularized log transformed count data. Scree plots show principle component numbers on the x-axis, and their respective eigenvalues on the y-axis.

The third graph plots experimental metadata against a number of PCs and their gene loadings to visualize the agreement with the gene expression pattern of each condition for each PC.

No Regression

Replicate Regressed Out

strain_hrde1_vs_wt

Inputs for function call:

compileReport(result.obj = results[[i]], contrast = sprintf("%s_%s_vs_%s", 
    contrasts[[i]][1], contrasts[[i]][2], contrasts[[i]][3]), 
    sample_name = trimws(sample_params$sample_name), l2fc_filter = 0.8, 
    padj_filter = 0.1) 

DEG Report

Ontology Report

Transfering WORMBASE gene IDs to ENTREZID:

308 gene IDs were not transfered from WORMBASE to ENTREZID 
12108 gene IDs were successfully transfered from WORMBASE to ENTREZID 

Number of highly variable genes identified (with | L2FC | > 0.8 and padj < 0.1) is:

495

Number of highly upregulated genes identified (with L2FC > 0.8 and padj < 0.1) is:

377

Number of highly downregulated genes identified (with L2FC < -0.8 and padj < 0.1) is:

118

Dysregulated

Gene Ontology GSEA

protein phosphorylation

phosphorylation

dephosphorylation

ribonucleoprotein complex biogenesis

tRNA metabolic process

collagen and cuticulin-based cuticle development

male gamete generation

mitochondrial translation

mitochondrial gene expression

RNA modification

rRNA processing

mitochondrial respiratory chain complex assembly

defense response to symbiont

immune response

immune system process

cellular respiration

establishment of protein localization to membrane

developmental process involved in reproduction

neuropeptide signaling pathway

cell fate commitment

Reactome Pathway GSEA

Mitochondrial translation termination

Translation

Mitochondrial translation

Mitochondrial translation elongation

SRP-dependent cotranslational protein targeting to membrane

GTP hydrolysis and joining of the 60S ribosomal subunit

Formation of a pool of free 40S subunits

Nonsense Mediated Decay (NMD) independent of the Exon Junction Complex (EJC)

Physiological factors

Nonsense-Mediated Decay (NMD)

Nonsense Mediated Decay (NMD) enhanced by the Exon Junction Complex (EJC)

Metabolism of RNA

Upregulated

GO Classification
GO Over-representation
Reactome Pathway Over-representation

Downregulated

GO Classification
GO Over-representation
Reactome Pathway Over-representation

Table Reactome Pathway Over-representation Analysis of Downregulated Genes has no rows.

strain_hrde2_vs_wt

Inputs for function call:

compileReport(result.obj = results[[i]], contrast = sprintf("%s_%s_vs_%s", 
    contrasts[[i]][1], contrasts[[i]][2], contrasts[[i]][3]), 
    sample_name = trimws(sample_params$sample_name), l2fc_filter = 0.8, 
    padj_filter = 0.1) 

DEG Report

Ontology Report

Transfering WORMBASE gene IDs to ENTREZID:

308 gene IDs were not transfered from WORMBASE to ENTREZID 
12108 gene IDs were successfully transfered from WORMBASE to ENTREZID 

Number of highly variable genes identified (with | L2FC | > 0.8 and padj < 0.1) is:

1557

Number of highly upregulated genes identified (with L2FC > 0.8 and padj < 0.1) is:

1051

Number of highly downregulated genes identified (with L2FC < -0.8 and padj < 0.1) is:

506

Dysregulated

Gene Ontology GSEA

dephosphorylation

protein phosphorylation

phosphorylation

defense response to bacterium

male gamete generation

neuropeptide signaling pathway

sexual reproduction

mitochondrial transmembrane transport

Reactome Pathway GSEA

Table Reactome Pathway GSE Of All Differentially Expressed Genes has no rows.

Upregulated

GO Classification
GO Over-representation
Reactome Pathway Over-representation

Downregulated

GO Classification
GO Over-representation

Reactome Pathway Over-representation